Skip to content

Conversation

medzin
Copy link
Contributor

@medzin medzin commented May 6, 2025

Summary

This commit adds logic to wait 500ms between subsequent calls of run_once if it returns an error. This is required to prevent spamming the SQS API if it fails to receive messages, e.g. due to invalid IAM permissions or an SQS outage.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • The CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • cargo fmt --all
      • cargo clippy --workspace --all-targets -- -D warnings
      • cargo nextest run --workspace (alternatively, you can run cargo test --all)
      • ./scripts/check_changelog_fragments.sh
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run cargo vdev build licenses to regenerate the license inventory and commit the changes (if any). More details here.

References

@medzin medzin requested a review from a team as a code owner May 6, 2025 21:26
@bits-bot
Copy link

bits-bot commented May 6, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label May 6, 2025
This commit adds logic to wait 500ms between subsequent calls of
run_once if it returns an error. This is required to prevent spamming
the SQS API if it fails to receive messages, e.g. due to invalid IAM
permissions or an SQS outage.
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @medzin, I understand the problem you are trying to fix here. Left a question.

Ok(()) => {}
Err(_) => {
trace!("run_once failed, will retry after delay");
tokio::time::sleep(delay).await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how this constant was chosen. Would an exponential backoff strategy work better here or it's an overkill?

Copy link
Contributor Author

@medzin medzin Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was treating this code as a hotfix rather than a long-term solution, so I chose a simple constant delay to quickly address the issue. The current 0.5-second delay doesn't introduce significant latency but greatly reduces the number of retries triggered by the source, which helps stabilize things short-term. That said, I agree a constant delay may not be ideal across all use cases - different values might work better for different users. An exponential backoff could offer a more flexible and resilient solution without needing to make the delay configurable. Happy to explore that in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this improves things in your situation but we generally don't want to merge hotfixes. If you have urgent need for this commit, you can always maintain a fork with this commit and use a custom Vector build.

Going back to this PR, what do you think about introducing a new opt-in config parameter here? We can probably reuse ExponentialBackoff from src/sinks/util/retries.rs.

Example:

  • 1st retry: 1 second
  • 2nd retry: 2 seconds
  • 3rd retry: 4 seconds
  • 4th retry: 8 seconds
  • 5th retry: 16 seconds
  • 6th retry: 30 seconds (capped)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will check it.

@pront pront added the meta: awaiting author Pull requests that are awaiting their author. label Jun 25, 2025
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Jul 9, 2025
@pront pront added the meta: awaiting author Pull requests that are awaiting their author. label Jul 9, 2025
@pront pront force-pushed the master branch 4 times, most recently from 1720078 to ffe54be Compare July 10, 2025 15:43
@pront
Copy link
Member

pront commented Oct 13, 2025

Couldn't commit to this branch due to perms, moved to: #23996

@pront pront closed this Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sources Anything related to the Vector's sources meta: awaiting author Pull requests that are awaiting their author.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No retry backoff for sqs in aws_s3 source

3 participants